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Abstract 

Background: The locations of the TM segments inside the membrane proteins are the consequence of a cascade 
of several events: the localizing of the nascent chain to the membrane, its insertion through the translocon, and the 
conformation adopted to reach its stable state inside the lipid bilayer. Even though the hydrophobic h-region of 
signal peptides and a typical TM segment are both composed of mostly hydrophobic side chains, the translocon 
has the ability to determine whether a given segment is to be inserted into the membrane. Our goal is to acquire 
robust biological insights into the influence of the translocon on membrane insertion of helices, obtained from the 
in silico discrimination between signal peptides and transmembrane segments of bitopic proteins. Therefore, by 
exploiting this subtle difference, we produce an optimized scale that evaluates the tendency of each amino acid to 
form sequences destined for membrane insertion by the translocon. 

Results: The learning phase of our approach is conducted on carefully chosen data and easily converges on an 
optimal solution called the PMIscale (Potential Membrane Insertion scale). Our study leads to two striking results. 
Firstly, with a very simple sliding-window prediction method, PMIscale enables an efficient discrimination between 
signal peptides and signal anchors. Secondly, PMIscale is also able to identify TM segments and to localize them 
within protein sequences. 

Conclusions: Despite its simplicity, the localization method based on PMIscale nearly attains the highest level of 
TM topography prediction accuracy as the current state-of-the-art prediction methods. These observations confirm 
the prominent role of the translocon in the localization of TM segments and suggest several biological hypotheses 
about the physical properties of the translocon. 
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Background 

The proteins transported into the endoplasmic reticulum 
(ER) include transmembrane (TM) proteins which be- 
come embedded in the ER membrane, and water-soluble 
proteins which are fully translocated across the ER mem- 
brane and released into the ER lumen. Proteins are guided 
to the ER while they are synthesized on the ribosome by a 
protein complex - the Signal Recognition Particle - that 
recognizes a targeting signal localized in the growing poly- 
peptide. The targeting signals are either N-terminal signal 
sequences called signal peptides (SP) or, in the case of 
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many membrane proteins that lack signal peptides, the 
first TM segment which is called a signal anchor. Insertion 
into the ER is then mediated by an evolutionarily con- 
served membrane protein complex, the translocon. This 
protein conduction channel provides a passage for pro- 
teins across the membrane as well as a means to integrate 
nascent proteins into the membrane through a lateral exit 
gate. When this gate is opened, TM segments may move 
from the aqueous interior of the channel into the lipid 
phase of the membrane. Finally, the stably folded mem- 
brane protein raises a minimum free energy inside the 
lipid bilayer. A large number of computational methods 
are available for detecting signal peptides (SignalP [1], 
Signal- 3L [2], Signal-CF [3], PrediSi [4]) or localizing TM 
segments (TMHMM [5], Phobius [6], MemBrain [7]). 
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Complete reviews of the advantages and shortcomings of 
these methods are available in [8,9]. Only a few of them, 
such iPSORT [10] or SOMRuler [11] concentrate on mak- 
ing the relationships between amino acid sequences and 
signal peptides or TMH transparent. 

The objective of our study is to evaluate the influence 
of the translocon on the partitioning of membrane seg- 
ments. In 2005, Hessa et al. attempted to elucidate this 
phenomenon of selective helical TM segment movement 
through the ER translocon site with a series of experi- 
ments. They developed an experimental system that 
makes it possible to measure the membrane insertion ef- 
ficiency of a large set of hydrophobic model segments 
[12,13]. These studies suggested that insertion or not of 
a helical TM segment is fundamentally a problem of 
equilibrium thermodynamics for most of the TM seg- 
ments. According to this hypothesis, the membrane in- 
sertion of a TM segment mainly depends on the local 
contribution of each amino acid inside the translocon 
channel. Moreover, it also suggested a strong position- 
dependence within the hydrophobic segment for each of 
the 20 amino acids. Nevertheless, this approach which 
leads to the so-called hydrophobicity biological scale 
(BH) is uniquely based on the variation of an engineered 
TM segment included in the protein leader peptidase 
(Lep). 

In order to benefit from the BH scale, two translocon- 
based prediction tools have been developed to predict 
the localization of TM segments: AG predictor [13] and 
SCAMPI [14]. These tools are based on the same calcu- 
lation of the free energy cost of insertion but they use 
different algorithms. Moreover, whereas SCAMPI calcu- 
lates the energy of peptides with a fixed length of 21 
amino acids, AG predictor allows length corrections. 
Even though prediction accuracies obtained by AG pre- 
dictor or SCAMPI are outperformed by tools such as 
OCTOPUS [15] or Philius [16], such methods are how- 
ever extremely useful as they attempt to give clues to 
explain biological observations and help us to more pre- 
cisely understand the mechanism governing the inser- 
tion of helical TM. In the future, an accurate prediction 
could identify segments that are borderline in their clas- 
sification, and that are therefore able to switch between 
TM and non-TM configuration states. Such switches are 
known to occur in a number of cases but it is difficult to 
evaluate their prevalence at this time [17]. Finally, a de- 
tailed understanding of topology determinants can lead 
to the design of hydrophobic helices with biomedical 
applications. 

In their work, Hessa et al. do not focus on the prob- 
lematic aspects of signal peptides and unfortunately, 
tools developed from their results have difficulty in dif- 
ferentiating them from TM segments. Nevertheless, even 
though the central region of signal peptides - the 



h-region - and a typical TM segment are both composed 
of mostly hydrophobic side chains, the translocon has 
the ability to sort them. If the translocon can determine 
whether or not a given segment should be inserted into 
the membrane, we can expect that essential elements 
promoting the phenomenon could be captured by in 
silico exploitation of the difference between the amino 
acid composition of the hydrophobic core of signal pep- 
tides and TM segments. Such an approach could benefit 
from a large number of learning datasets but these data 
must be chosen carefully. Although the UniprotKB/Swis- 
sProt annotations cannot be regarded as experimentally 
established topography data, we chose this databank to 
construct our training data set because it allows the se- 
lection of only eukaryotic proteins with a type II or type 
III signal anchor annotation. To decrease the bias intro- 
duced by the use of the TM prediction tools which may 
be the origin of annotations in UniprotKB/SwissProt, we 
considered that TM segments are actually not precisely 
located. Several studies have shown that some TM seg- 
ments in polytopic proteins need to cooperate during 
the membrane insertion step [18]. The exclusion of poly- 
topic proteins from the training data eliminates TM seg- 
ments that depend on other parts of the protein for 
efficient insertion and folding. We insist on this restric- 
tion because the inclusion of polytopic proteins in train- 
ing data may compromise the prediction accuracy of 
bitopic proteins and vice-versa. 

The in silico elaboration of a new scale can be consid- 
ered as an optimization problem and a local search ap- 
proach is an effective technique to solve it. In previous 
work [19,20], our scale assigned a symmetric curve to 
each amino acid in order to take into account its pos- 
ition inside the translocon, but we only partially suc- 
ceeded in obtaining stable curves. In this study, we 
instead assign an average value for each amino acid and 
arrive at the striking observation that, with very few pa- 
rameters, this new scale, called PMIscale, obtains quite 
good results, both for discriminating SP from TM seg- 
ments and for capturing a large part of the information 
required to locate TM segments along the membrane 
proteins. 

Results and discussion 

The new PMIscale 

The local search algorithm used to find a new scale that 
discriminates signal peptides from signal anchors quickly 
converges on an optimal solution that results in the 
PMIscale displayed in Table 1. A high value corresponds 
to a high preference for TM segment insertion. 

When compared with other hydrophobicity scales, we 
observe that the PMIscale highlights that more efficient 
promotion into the membrane insertion occurs for the 
aromatic side chains Trp, Tyr, and Phe. These results 
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Table 1 Amino acid PMIscale values 



A 


1.8 


G 


1.6 


M 


1.9 


S 


-1.8 


C 


2.1 


H 


-0.2 


N 


-3.5 


T 


2.3 


D 


-8.5 


I 


6.5 


P 


-4.6 


V 


4.2 


E 


-11.5 


K 


0.1 


Q 


-4.5 


w 


5.7 


F 


5.8 


L 


3.8 


R 


0.5 


Y 


6.1 



suggest that these amino acids participate strongly in the 
recognition of TM helices by the translocon. In addition, 
PMIscale does not penalize basic amino acids - Arg, 
His, Lys - as much as the other scales even though the 
lipid bilayer does not favor their presence, suggesting 
that the translocon may also play a major role in the in- 
sertion of these amino acids. This result agrees with the 
computer simulations of a helix containing an argin- 
ine sidechain conducted specifically to consider the 
sidechain moving from the translocon to the lipid bi- 
layer [21]. 

Evaluation of the discrimination between SP and TM 

As shown in Table 2, the PMIscale enables a significantly 
improved discrimination between SP and TM than other 
scales when a sliding-window approach is used as de- 
scribed in the Methods section. The quality of such a 
classification system can be evaluated by the Area Under 
the ROC curve (AUC) [22]. The performance of PMIscale 
for discriminating SP from signal anchors is excellent for 
the SWPTest dataset (AUC = 0.932) and, unlike other 
scales, it also exhibits suitability for the task of discrimin- 
ating SP from TM segments as shown by our benchmark 
for the PDBTMSeg dataset (AUC = 0.803). For informa- 
tion purposes, Additional file 1: Table SI also provides in- 
sights into the effectiveness of PMIscale compared with 
two widely used machine learning-based methods [1,6]. 

Prediction of membrane proteins in proteome-wide 
studies 

If PMIscale conveys relevant information about the 
translocon mechanisms, it should also be able to predict 
accurately whether a protein is a membrane protein or 



Table 2 The AUC quality assessment of the discrimination 
between SP and TM segments on several datasets 

SWPTest ScampiHigh ScampiLow PDBTMSeg 



PMIscale 


0.932 


0.86 


0.845 


0.803 


K&D 


0.829 


0.662 


0.691 


0.636 


GES 


0.793 


0.736 


0.737 


0.667 


BH-2005 


0.895 


0.752 


0.733 


0.676 


TM tendency 


0.887 


0.792 


0.814 


0.756 


AvgH 


0.837 


0.706 


0.726 


0.67 



The scales shown are from the following references: KD [23]; GES [24]; BH [12]; 
TM tendency [25]; AvgH [26]. 



not. It was previously observed in [27] that the energy 
required for the insertion of the TM segment of a bito- 
pic protein must be higher than the energy required by 
the insertion of the following TM segments. In our ap- 
proach, we extended this observation with the notion of 
'first TM segment' - the TM segment of a bitopic pro- 
tein or the first TM segment of a polytopic protein - and 
we introduced two thresholds in the TM localization al- 
gorithm: xfirst for the insertion of the first TM segment 
deduced from the threshold that separates signal pep- 
tides from signal anchors, and Tnext for the following 
TM segments. In addition it was also observed that 
in vitro, the SRP binding to the ribosome nascent chain 
declines when the nascent chain reaches a length of 
110-140 amino acids [28]. Therefore when evaluating 
methods based on a sliding-window approach, we added 
the constraint that the signal anchor is not located after 
that limit. Consequently xfirst determines if a protein 
is a membrane protein or not for only that limited 
N-terminal part of the protein. 

A guideline to proteome-wide a-helical membrane 
protein topology has been published recently [29] giving 
the opportunity to compare the PMIscale predictions 
with 18 algorithms on control datasets. We compared 
PMIscale on two benchmark datasets extracted from this 
work that permit evaluation of membrane-inserted pro- 
teins. We also performed a comparison with the AG pre- 
dictor method, because this method is directly based on 
the Hessa et al. [12] biological scale. The first dataset is 
composed of cytosolic proteins without any signal pep- 
tide or TM segment. For this dataset, the PMIscale 
based predictor predicts 2.8% proteins with at least one 
TM segment which places it as one of the three best 
methods, 12% better than the average performance of 
the evaluated programs in [29]. The second dataset is 
composed of extracellular proteins that contain a signal 
peptide but no TM segment. The PMIscale predicts 10.2% 
proteins with at least one TM segment, which is 30% 
higher and therefore significandy better than the average 
performance of the 18 more sophisticated methods. We 
can note that over-prediction errors are much less abun- 
dant with PMIscale than when other hydropathy plot 
methods are used, placing it at the same level as the best 
methods Phobius [30], Phillius [16] and Polyphobius [31]. 
The AG predictor is not adapted to this situation (it pre- 
dicts 70% of proteins as having at least one TM segment), 
which indicates that the BH scale may not differentiate 
signal peptides from TM segments. 

The last benchmark dataset we tested was extracted 
from the benchmark server developed by Rath et al. [32] 
which offers general and specialized assessment of exist- 
ing and novel membrane helix prediction methods. In 
this dataset, the SP was cleaved from the mature protein. 
We evaluated the standard benchmark referred to as the 
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TMH_l/2MH_OPM_BB_SOLB dataset (which consists 
of sequences having less than 30% similarity to each 
other possessing membrane helices that are long enough 
to traverse the membrane and known membrane helices 
that do not traverse the entire membrane). PMIscale 
achieves a good performance that is 21% better than the 
average of the 53 available methods. Performance result 
details are available in Additional file 1: Table S2. 

Prediction of TM localization in membrane proteins 

The initial objective of PMIscale was to provide infor- 
mation about the translocon passage. However, our ex- 
periments demonstrated that this scale is also able to 
localize TM segments in protein sequences. To evaluate 
this point, we used two benchmark datasets. One is 
composed of 1311 G-protein coupled receptors (GPCRs) 
extracted from [29]. The particularity of this data set is 
that the topology of GPCRs is challenging to predict, as 
several of the TM helices are uncharacteristically hydro- 
philic. In our benchmark, the prediction is regarded as 
correct if it contains all 7 and only 7 TM segments. In 
this case, the PMIscale prediction is lower than the 
average of the evaluated methods with only 37% of the 
proteins predicted with 7 TM segments. This is not sur- 
prising because TM helices in the case of the GPCRs 
could probably not be predicted with any method lim- 
ited to the composition of the individual TM segments 
alone. The usefulness of high accuracy prediction of 
transmembrane inter-helix contacts has been demon- 
strated in this particular protein family [33]. In this more 
challenging case, sophisticated methods command an 
advantage because they additionally extract information 
from global features of the sequences, rather than using 
only local features of the TM segments. Additional glo- 
bal information about the positively charged residues of 
the alternate sides of the membrane and the general bias 
of the charges between regions of the proteins has also 
been proven to be useful [34]. Moreover, with this par- 
ticular family, the prediction power can be improved by 
multiple sequence alignment information. We also note 
that prediction performances on this particular dataset 
vary a lot between algorithms. 

The second dataset was a standard benchmark data 
subset suggested by [32], that we used to compare our 
novel PMIscale to 52 transmembrane helix prediction 
methods freely available to be run in batch mode. The 
evaluation was limited to topography scores, i.e. the ac- 
curacy per protein sequence and the accuracy per seg- 
ment. The results of specificity and the percentage of 
correctly predicted proteins show that the performance 
of PMIscale is significantly better than average. However, 
PMIscale's performance for sensitivity is slightly lower 
than the average of the methods. The results are sum- 
marised in Table 3, and a detailed comparison with each 



Table 3 Benchmark measures on a dataset extracted from 
the Rath et al. benchmark web server 





Sensitivity 

(%) 


Specificity 

(%) 


Correctly predicted 
sequences (%) 


PMIscale 


80 


90 


88 


Averaged performance 


82.6 


70.7 


66.2 


of 52 methods 









This benchmark contains 599 sequences - 133 membrane proteins TMH or 
fi-barrel and 466 soluble proteins - including 483 membrane helices; the 
averaged performance was calculated on the 52 transmembrane helix 
prediction methods with topography information available on the web server - 
TMLOOP method is not taken into account. 



method is available as supplementary data in Additional 
file 1: Table S2. We also used this dataset to evaluate the 
PMIscale-based predictions when the length of the slid- 
ing window is modified and we observe a moderate deg- 
radation of the performance when the window is set to 21 
or 25 amino acids (shown in Additional file 1: Table S3). 
Finally, we measured the impact on prediction accuracy 
when the values of the thresholds xfirst and xnext are 
modified. The results show that higher values improve the 
specificity of predictions whereas lower values are able to 
identify all the membrane proteins with very few excep- 
tions. Performance results for modifications in thresholds 
are available in Additional file 1: Table S4. 

Conclusions 

PMIscale is able to distinguish signal peptides from TM 
segments as the translocon does. Moreover, accuracies 
obtained by the PMIscale on all the benchmark datasets 
are close to those of the most accurate and sophisticated 
methods. This occurs despite the fact that our method is 
based on a simple algorithm and has only 22 parameters - 
20 values from the PMIscale and 2 thresholds, rfirst and 
xnext. Information used in the predictions here is strictly 
limited to the amino acid composition of the protein seg- 
ment and is derived from the bias observed between the 
composition of the signal peptides and the signal-anchor 
segments. Compared to usual sliding-window approaches 
used to precisely localize where the TM segments are, im- 
provements in predictions are due to the new scale, and 
also due to the introduction of a threshold term which dif- 
ferentiates the first segment. 

Our in silico results are consistent with the experi- 
mental results of Hessa et al., as they suggest that the 
translocon passage is the major factor that influences 
the TM segment positions. Nevertheless, we can also note 
that taking into account the position of the amino acids 
inside the translocon does not give rise to much predictive 
benefit in the comparison of the performances of the 
SCAMPI, AG predictor and PMIscale methods. 

Some particular protein families such as the GPCRs 
require more specific algorithms for the precise locali- 
zation of their TM segments. However, PMIscale could 
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be very helpful for proteome-scale or genome-scale 
studies: the PMIscale-based sliding-window predictor is 
easy to use, quick and efficient which is important for 
large-scale genome processing. Moreover, if the objective 
of a prediction is to elaborate target lists that either ex- 
clude or specifically select integral membrane proteins 
as it is sometimes required in structural genomics pro- 
jects, it is easy to modify the thresholds xfirst and xnext 
to adjust the resulting inclusivity levels. An online 
service for individual predictions and a stand-alone 
PMIscale package for genome-scaled predictions writ- 
ten in Perl are provided on the web site http://wwwappli. 
nantes.inra.fr/bioinfoweb. 

Methods 

Selection of the data sets 

The major problem with the training data sets of TM 
prediction methods is the small number of membrane 
proteins in the PDB database [35]. This proportion is 
less than 2% according to the PDBTM database [36]. It 
has also been shown that the commonly chosen test sets 
are biased and, consequently, the reliability of the pre- 
dictors could be lower than reported [37]. Even though 
the UniprotKB/SwissProt annotations cannot be regar- 
ded as experimentally established topography data, we 
chose this database and retrieval tools [38] to construct 
our training dataset because it allows the selection of ex- 
clusively eukaryotic proteins with a signal anchor anno- 
tation. A signal anchor serves the purpose of the ER 
targeting as does a signal peptide, but it inserts into the 
membrane while the signal peptide does not. For our 
purpose, we selected reviewed eukaryotic proteins marked 
with a "Signal-anchor for type II membrane protein" or 
"Signal-anchor for type III membrane protein" annotation 
[6]. We added the 10 adjacent amino acids to the TM seg- 
ment or fewer if the number of adjacent amino acids was 
lower than 10. CD-HIT [39] was then used to obtain a 
final non-redundant protein set with an identity cut- 
off at 30%. 

The signal peptide dataset was extracted from SwissProt 
with only eukaryotic proteins marked as "verified experi- 
mentally". This dataset, limited to the first 60 amino acids 
of each protein, was submitted to the CD-HIT program to 
obtain a non-redundant dataset with an identity cut-off at 
30%. After this step, we obtained 1765 sequences with sig- 
nal peptides in the SPexp dataset. One part of the SPexp 
dataset - 1000 sequences - and the totality of the 435 TM 
segments were divided at random into one training data- 
set called SWPLearning (305 TM, 700 SP) - Additional 
file 2 - and one test dataset called SWPTest (130 TM, 
300 SP) - Additional file 3. 

Our method is benchmarked using the SCAMPI data- 
sets [14] and another derived from a recent extraction of 
TM protein segments from the PDBTM database [40] 



for which reduction to a non-redundant set was per- 
formed with an identity cut-off of 30% and completed by 
the remaining 765 sequences from the SPexp dataset. 
The resulting datasets were referred to as ScampiHigh - 
Additional file 4 - and ScampiLow -Additional file 5 - 
respectively for SCAMPI TM datasets with high- or 
low-resolution data, and PDBTMSeg - Additional file 6 -. 
It is important to note that there is no redundancy be- 
tween the PDBTMSeg dataset and the SWPLearning and 
SWPTest datasets. Finally several sequence selections from 
the 'Benchmark of membrane helix predictions from se- 
quences' site [32] were used to evaluate PMIscale on PDB 
datasets. 

Local search algorithm for averaged values 

Our algorithm, used to determine if a training segment 
is an SP segment or a TM segment, is based on a 
sliding-window approach. The value of a window is cal- 
culated as: 

i +n- l 

Hi = ^h(rj) (1) 

M 

where i is the position of the first residue within the slid- 
ing window, r is the residue at position j in the sequence, 
n is the length of the fixed window, and h(rj) is the PMI 
value. We define the PMI value of a sequence as the 
maximum value obtained when sliding the window along 
the sequence. If this value exceeds a threshold xfirst the 
sequence is considered as a TM segment. Otherwise, it is 
considered as a signal peptide. 

PMI values are optimized by a local search method in 
order to obtain the best discrimination between SP and 
TM segments. Local search algorithms are modern heur- 
istic methods designed for tackling hard optimization 
problems (see [41] for a review of these methods and 
their applications). 

A local search algorithm starts with an initial candi- 
date solution of the given search space and iteratively 
moves from the current solution to a neighboring solu- 
tion that improves the function that must be optimized. 
At each step of the local search algorithm, all candidate 
neighbors of the current solution are evaluated. Accor- 
ding to the steepest hill-climbing strategy, the best so- 
lution among the neighbors is chosen to replace the 
current solution, and the local search process is iterated 
from this new solution. The quality of a neighbor solu- 
tion is assessed by an evaluation function based on the 
Area Under the ROC Curve (AUC) [22] with the ROCR 
package [42]. It estimates the ability of the solution to 
obtain a suitable discrimination between SP and TM 
segments. 

A candidate solution is a set of 20 PMIscale values, 
one for each amino acid. The initial PMIscale values are 



Tessier et al. BMC Bioinformatics 2014, 15:156 
http://www.biomedcentral.com/1471 -21 05/1 5/1 56 



Page 6 of 7 



set with the Kyte & Doolittle hydrophobic indexes. In a 
candidate solution, a PMIscale value is treated in three 
different ways to obtain a neighboring solution. It can be 
kept unchanged, and increased or decreased by a delta 
variation. Candidate neighbor solutions are obtained by 
combining these possible transformations of each PMI 
scale value. 

Furthermore, we must keep in mind that we consider 
the localization information given by the SwissProt data- 
bank as approximate. To allow the movement of the 
window of maximum value along the sequence, signifi- 
cant valuable delta variations [+3, -3] are tested in the 
first iteration of the local search. These delta variations 
were gradually decreased during the following iterations. 
A systematic search including modifications of the 20 
amino acid values at each step would be too time con- 
suming. Therefore, to overcome this limitation, three 
groups of amino acids were defined. The local search 
process first deals with the first group, Gl, and deter- 
mines the optimal values for the amino acids of Gl. It 
then searches the optimal values for the amino acids of 
the second group, G2, and finally deals with the third 
group, G3, in the same way. Results presented in this 
paper are obtained with the groups Gl = {F,L,I,V,Y,W}, 
G2 = {A,T,D,E,R,G,H} and G3 = {C,K,S,M,N,P,Q} which is 
the best grouping that has been tested. Nevertheless, we 
can note that several runs executed with several amino 
acid groupings gave slightly similar results. Learning per- 
formances vary also slightly according to a small vari- 
ation of the length of the fixed window from 21 to 25 - 
the AUC decreases less than 6%. Nevertheless, the best 
performance is obtained with a fixed length equal to 23 
amino acids. 

The localization of TM segments 

To determine the localization of the TM segments, we 
developed a straightforward algorithm. A sliding window 
of fixed length (n = 23 residues, consistent with the 
learning dataset) is scanned across the protein sequence 
and a PMI value is calculated with Eq. (1) at each pos- 
ition along the sequence. The first window position that 
gives the PMI value above xfirst localizes the first TM 
segment. The iterative process continues to localize the 
other segments with a threshold Tnext. Moreover, at 
least two amino acids separate two consecutive TM 
segments. 

A threshold value equal to 2.7 equilibrates the confu- 
sion between signal peptides and signal anchors on the 
SWPlearning dataset - ie the number of SP predicted as 
signal anchors is roughly equal to the number of signal 
anchors predicted as SP. The minimal threshold xfirst 
required to predict the insertion of the first TM segment 
was extrapolated from this observation. Next, we evalu- 
ated xnext with the PDBTMSeg dataset. We chose to 



optimize the specificity rather than the sensitivity par- 
ameter, because our hypothesis is that some TM seg- 
ments requires help to insert into the membrane and so, 
it is expected that some TM will be missed. The specifi- 
city value is set on the level of the best performing pre- 
diction methods - ie 0.96 - which leads to xnext = 2.1. 
All our benchmarks in this paper have been performed 
only with these two thresholds: xfirst = 2.7 and Tnext = 
2.1. Evolution of predictions according to the threshold 
xfirst and xnext are available in Additional file 1: Table S4. 

Availability of supporting data 

Several data sets supporting the results of this article are 
included within the article and its additional files. 
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of the xfirst and Tnext parameters. 
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